Mass Spectrometry Data: An Introduction
Has it ever happened to you that you got your hands on some nice, big, mass spectrometry dataset, but you weren't sure what to do with it? If so, this post may be for you. If not, come join the fun! The ProteomeTools dataset, composed of synthetic human peptides, or the more realistic Massive-KB dataset are good starting points. They have the nice properties of being large and also providing reliable ground truth peptide identities for their spectra.
Context
Some brief context first. What I'm going to share is based on my experience during my PhD. I did not get any particular training in mass spectrometry or proteomics. While I am fairly confident in what I'm about to show you, I would appreciate any notes, corrections, addenda, or caveats be directed to mail@ on this site.
My background is entirely from the computational side, no biology here. Therefore, this introduction to mass spectrometry data analysis will follow a computational angles: less discussion of what it means, biologically speaking, and more about what to watch out for, how to evaluate the data, how to process it, and so forth.
Mass spectrometry can be used at many other sauces which will not be covered here: I assume a peptide identification workflow using tandem mass spectrometry in shotgun proteomics (i.e. we never go all the way to proteins, in part because the discussion of error control at the level of proteins is even more complicated than at the peptide level; and we don't assume quantitative proteomics either). To keep things simple, I'll also assume all the data uses higher energy collisional dissociation (HCD) fragmentation at a fixed normalized collision energy (NCE).
Wetlab
Figure 1 - A: A typical proteomics workflow. Proteins are collected. They are then enzymatically digested (typically by trypsin) to yield smaller peptides, which are in turn ionised (e.g. by electrospray (ESI)). The resulting charged particles are sent to the mass spectrometer to obtain mass spectra. In tandem mass spectrometry, the particles are broken down further, usually by collision with a neutral gas. This gives us a mass spectrum with, in theory, amino acid resolution. B: A schematic of a triple-quadrupole mass spectrometer for tandem MS.
The most common use case and workflow in mass spectrometry is for the identification of proteins from some source (often, a certain set of cells of interest, perhaps following some treatment, or being afflicted by some cancer or other). The process is illustrated in Figure 1. Since this post is focused on the computational side, I will only give a brief, focused overview of the steps involved here. Roughly,
  • Cells (or other protein-bearing specimen) are collected from some source
  • Proteins are extracted from the source
  • (Optional) Proteins are separated based on some feature, like mass range
  • Proteins are digested by some process, such as trypsin digestion, into peptides
  • (Optional) Peptides are separated based on some feature, like hydrophobicity or (again) mass range
  • Peptides are ionised and injected into the mass spectrometer
  • The mass spectrometer directs the charged particles to a detector, analyzes the raw readings, and outputs a mass spectrum
  • The data is analyzed by identification tools, etc.
Nowadays, as far as I'm aware, it's common to only separate after digestion, and separation is typically done by a high-performance liquid chromatography (HPLC) column. The last point in bold is where computational people like me are supposed to come in, and we will be mostly ignoring all the other steps in this pipeline here, assuming it's all been taken care of correctly and our data is standardized on whatever works. This, of course, does not hold in real life. Here, we are interested in tandem mass spectrometry (also called MS/MS or MS2). MS1 (where peptides are not broken down further toward single amino acids) is not generally considered precise enough to differentiate between the mass of peptides due to the large potential range of mass conflicts (assume the abstract masses: 1+3 = 2+2, then you can't differentiate between the two 2 amino acid sequences this could be. Now consider you don't know the length in advance, so it could be 1+3 = 1+1+1+1 as well, and so on).
What's in a Spectrum?
Figure 2 - A mass spectrum (A) and typical processing steps (B, C). The "theoretical" spectrum from the peptide sequence is shown (D) and overlaid on the experimental peak sequence (E) for visualization purposes. The datasets we have access to typically only reliably provide us the fully cleaned up version of the spectrum (E).
First, let's talk about what our data, that is, spectra, are. A mass spectrum is a set of mass-to-charge ratios (m/z, which unit is taken to be, in the literature, the Dalton or Thompson (Dalton divided by charge); used interchangeably in practice) with associated intensities (arbitrary units). For peptide identification, our goal is to take this sequence of m/z-intensity pairs and turn that into a sequence of amino acids, i.e. the peptide's primary sequence. A realistic mass spectrum and common processing steps are shown in Figure 2. In-silico spectrum generation (the "theoretical spectrum") will be discussed shortly, but notice how few of its peaks match those found in the experimental spectrum. This is rather typical in realistic datasets and serves as a clue that (confident) identification is not a walk in the park. Typically, datasets available in the wild only reliably provide a highly processed spectrum as shown in Figure 2, not the actual raw signal. This is called a "centroided" spectrum, as opposed to "profile-mode" which samples at fast intervals and records everything. The centroided spectrum is also sometimes called simply "peak list" or a "peak-picked" spectrum. A common paradigm is to save MS1 data in profile mode, but MS2 (i.e. tandem) data in centroided mode.

After the wetlab part of a proteomics workflow, we're typically left with proprietary vendor-specific files, such as Thermo-Fisher .RAW or AB SCIEX .WIFF. A lot of people publish those proprietary files instead of doing conversion first, so the first step is to get the data into shape. The main options are mzXML (and XML-like format, which I wouldn't recommend), or MGF (the mascot generic format, sometimes also called mascot general format, although that's incorrect).
The MGF format is not standardized, but still routinely used. It's rather easy to work with, easy to read and even easy to edit by hand. It looks like this:

BEGIN IONS
CHARGE=3+
SEQ=PEPTIDE
SCAN=1435
SCANS=1435
TITLE="mgf_library_17/first_pool/f1000.raw"
123.123 99999.123123
234.234 8980.1727
...
1024.554 324.4
END IONS


Features of interest include:
  • BEGIN IONS and END IONS. I have never seen a mgf file fail to place or write them in exactly the same way.
  • Field order and filed types: neither are defined and they vary a lot. One source may have the SEQ= field, another will not. One may call their SEQ= field SEQUENCE= instead, and so forth.
  • Redundant fields: SCAN= and SCANS= for instance. Inconsistent values in fields: CHARGE=3+ may be spelled CHARGE=3 (no +) in another source.
  • Inconsistent separators: some mgf's use spaces to separate m/z and intensity values, others use tabs. Some files will also have blank lines all over, while others will have none
That said, fields are consistent between entries: whatever set of fields and value formats are used in the first spectrum will typically (as good as "always" in practice) be the same in all other spectra in the dataset. That's still a lot of room for complications, however, if one is not careful, for example when developing on one dataset and testing on a different one. Nevertheless, mgf still remains, in my opinion, the superior option, as it optimizes practical concerns like compatibility and ease of processing. There are a few freeware and opensource tools available online to convert from proprietary formats to mgf, including e.g. RawConverter or ThermoRawFileParser for Thermo .RAW files.

Basic QA
Figure 3 - Some statistics of the ProteomeTools data. A: Precursor mass vs peptide length. B: Distribution of peptide lengths. C: Precursor mass vs Andromeda (MaxQuant's peptide search engine)'s score for the peptide-spectrum match (PSM). D: Peptide length vs Andromeda score.
If all went well and the files are usable, spectra should contain, at least, the precursor mass (that is, the mass of the whole peptide whose fragments are represented by the spectrum we are currently looking at), its charge, and non-empty m/z-intensity peak lists. If any of these are missing, the data will be of limited use. With an ESI+HCD process on common instruments, most spectra should be of charge 2+ or 3+. The precursor neutral mass (that's the precursor m/z corrected for precursor charge) distribution can also help. For human peptides, the range should be around 700-5000 Da with an average around 1500 Da, with a heavytailed distribution. This will not necessarily hold depending on sample type: for example, if the data comes only from cancer cell lines. Another way to assess quality is to compare replicates (e.g. by some kind of correlation metric, like pearson, or the cosine similarity) when such information is available. If not, one can compare samples with sequential scan numbers, as they're most likely to be correlated. When replicates are identified, a cosine similarity of about 70% is quite strong. A value closer to 50% is not too shabby depending on the nature of the dataset, but it's usually cutting it close. The data can also be thrown at a peptide identification program (provided it was calibrated correctly) like Comet to see the distribution of scores. If the software can't identify anything, the problem is likely with the data rather than the software (obviously, this depends on the nature of the experiment that generated the data). Figure 3 shows some basic statistics from a subset of ProteomeTools. Since it is a dataset of synthetic peptides, its quality is unusually high.

Basic Cleanup
Empty or near-empty spectra can fairly safely be discarded in most cases, it won't be possible to get much out of them unless, perhaps, if the proteome under study is very sparse. It's also common to remove spectra with very few or many peaks (through an arbitrary threshold, like ≤ 20 or ≥ 500), as they are likely to be pure noise. The resulting spectra are also often intensity-culled: the intensity values are divided by the max intensity in each spectrum, and peaks with intensity not exceeding 1% or 5% of the max intensity (i.e., after the division, having values of 0.01 or 0.05 respectively), for example, are removed. Sometimes, a transform such as log or square-root is used either before or after the cull. Another common processing step is to correct the predicted charge provided by the instrument. This is done by locating peaks which are 1/z Da apart for some charge level z, corresponding to the mass difference under the hypothesis that the peptide is of charge level z for the isotopes of carbons. Depending on dataset, 1-10% of the data's charge may have been incorrectly allocated. This can also be observed when high intensity peaks are widely below or above the m/z reported in the file. A related transformation is the removal of isotope peaks, called "deisotoping". However, in my experience, deisotoping has a high false positive rate. The result is reduced identification rates on cleaner datasets (but it does tend to help on dirtier data). When sequence identities are available, a dataset can further be quality-controlled by removing spectra whose precursor exact mass is too far from the theoretical mass of the sequence. This can be seen as a more refined version of charge correction (since exact mass is usually wrong on account of incorrect charge in the spectrum file).
In deep learning, a common transformation is to "rasterize" the spectrum into a fixed length vector (for example, a vector where each element represents a further 0.1 Da increment, so the intensity of peaks 1193.12 and 1193.14 would be summed together within the vector element 11930 (0-based)), which amounts to inverting the peak picking transform (but not recovering all the potentially lost data from the peak picking process)... Would such algorithms perform well on profile-mode spectra? If only there was a way to find that out!
Generating a Spectrum
Figure 4 - An example peptide and its labeled fragments. Z and down starts from the end of the peptide, A and up goes from the beginning.
Now that we know what the data looks like and we have it in a usable format, what shall we do with it?
We are interested in peptide identification, and the state of the art is not much more than in-silico digestion of candidate proteins, filtering to the spectrum's precursor mass (within tolerance), and some kind of correlation metric. Figure 4 shows a typical peptide (the R# are "residues" attached to the "peptide backbone" -- an amino acid's name identifies this so-called residue in particular). The physical model of fragmentation typically considered admits fragmentation events at some specific bounds on the molecule, as depicted in the image. Those bonds are labeled based on their location and order. With HCD, the main fragments are expected to be of type b or y. The ions generated by the fragmentation following such a pattern forms a series, so here we are interested in the "b-ion series" or the "y-ion series". More complex models also consider so-called neutral masses, the loss of uncharged molecules from the fragment. However, their consideration is still not all that common, and in my experience, they don't tend to be any more likely to be present in the spectrum (on average) than just the b- and y-ions (refer to Figure 1), so the added noise does not seem that useful to me. On the flipside, we now have amazing deep learning models like PredFull, which generate very accurate predictions for the entire spectrum (not just those series and/or neutral losses). A major caveat is that this kind of model does not seem to generalize too well to other species yet. But I digress.

To generate a simple b- and y-ion series theoretical spectrum, the first step is to acquire a protein database. A common resource to find one is the UniProt database. The next step is to do in-silico digestion. If we're using the typical trypsin, the regular expression rule (as per the Expasy PeptideCutter tool is ([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P)), which means, roughly, "Cut after lysine or arginine, but not if followed by proline, except if it's a lysine preceded by a tryptophan or an arginine preceded by a methionine (in which case, do cut despite the proline)". Applying this rule gives us all potential digestion targets, but real life does not work like that, and trypsin will not fully digest everything for a variety of reasons. To generate our spectrum, we have to consider at least missed cleavage, that is the sites that don't get digested. It's just a matter of sticking consecutive bits together after a full theoretical digestion. It's typical to consider up to 2 missed cleavage, which means we have to add to our peptide pool not just the fully digested protein, but also all pairs and all triplets of digested products that are consequent in the direction of the protein sequence. Not covered are more complications, like semi-tryptic digestion or post-translational modifications (PTM), such as oxidation or fixed modifications like carbamidomethyl cysteine (CAM), or more complicated models including neutral losses and other events.

Now, all we have to do is compute the mass sequence resulting from a candidate peptide sequence at a given charge level (remember: we need m/z, not just m). This is simple: we just sum the individual amino acid masses, the mass of the two peptide termini, mass changes due to post-translational modifications, and then we just deal with the charge. Given that to be of charge z, a peptide must have z-1 extra protons compared to the charge 1 molecule, we have to remember to account for that mass before dividing by the charge to obtain our m/z. Mind the mass difference between the resulting calculation for the y- and b-series (or, if using, other series like the a, z, c or x). Peptide search engines, various utilities, and online resources are all available to compare results, such as Protein Prospector's MS-Product.
Conclusion
Now that you have your data and your theoretical spectra, peptide identification is simply about selecting the subset of peptide sequences in your database that are relevant to the current spectrum based on mass and charge, then computing a similarity score. Designing the right score is an active area of research. Current search engines are not designed with computational research in mind, which means that any improvement has to fit the box setup by another engine, or must be implemented in a novel search engine. I believe that this is a major hurdle to modern proteomics as a whole, which is why a significant portion of my PhD was spent developing an research-first platform (which ended up running faster than common engines like x!tandem, and, if electing to eschew the speed, can generate 20% more identifications at the same controlled error rate than Comet in early tests).

The last part of the story (although there are some interludes, such as rescoring, if one is so inclined) is error control. The datasets I link to in the beginning of this post have ground truth peptide sequences available, which makes evaluating search engine improvements straightforward in this context. But how can we get confident identifications from our scoring function with a novel sample set, when we don't know what the actual peptides in the sample are? The common approach today is to do a search with not just the digested proteins, but the digested inverted proteins (there are other approaches, but they are less robust in terms of error control in general). Peptides generated this way that are already found in the real candidate database are removed, and the remaining "decoy" peptides are supposed to be similar enough to the real ones that the search engine is supposed to assign false hits uniformly at random between our real false hits and our decoy database. Thus, we can simply use formula (decoy hits) / (true hits + false hits) under the assumption that decoy hits = false hits (so it is equivalent to (false hits) / (true hits + false hits)) to estimate the false discovery rate (FDR). This approach is called "target-decoy approach" or TDA. Unfortunately, this method is not robust and frequently overestimates true false discovery rates by factors as high as 5 to 10 times. Moreover, it prevents comparing identification results between databases, because the distribution of their hits differs and they over or underestimate false discovery completely differently.

Unfortunately, not only is this method widespread, modern research efforts are widely directed at breaking the required "decoy hits = false hits" assumptions in various ways. I instead propose that, at the very least, any new proposed identification approach must be evaluated in terms of true false discovery rate (sometimes referred to as "factual FDR" or, more commonly, as false discovery proportion (FDP)).

Perhaps rescoring or TDA-FDR will be the topic of another post. In any case, you are now armed with enough knowledge to be dangerous with mass spectrometry data. Watch your steps, good luck, and happy hacking!

See my followup post for a worked example of a peptide identification workflow.
2023-06-18 - 2023-06-21